-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Validation and Proposal message relaying #3412
Conversation
gregtatcam
commented
May 21, 2020
- The motivation for the reduce relay is to reduce CPU and bandwidth cost and improve link latency.
- Every node independently goes through the peer selection rounds at some random intervals.
- MAX_PEERS are randomly selected from the pool of directly connected peers with the Validation and Proposal message count from a validator in {MESSAGE_LOW_THRESHOLD,MESSAGE_UPPER_THRESHOLD}. The peers are chosen to be the source of Validation and Proposal messages from this validator. Protocol message is sent to the rest of directly connected peers instructing them to squelch (stop relaying) the message for some random period of time in {MAX_UNSQUELCH_EXPIRE,MIN_UNSQUELCH_EXPIRE}.
- When the squelch expires or when a new peer connects to the node, another peer selection round starts.
- When a peer, which is selected to be the source of the messages for the given validator disconnects, or is idled then unsquelch message is sent to all squelched peers and another peer selection round starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick first pass and had some minor comments. I will get better understanding of the overlay relay code, then finish the review later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor feedback on the structure of the PR:
It's easier for reviewers if the commits are organized into logical units. This PR has 70 commits, and those commits reflect the way the feature was developed. As a reviewer, I'm less interested in the timeline of how the feature was developed; I'm more interested in "here's a patch with all the XXX changes".
The biggest issue in this commit is making the Slots
and Squlech
classes generic (especially Slots
). For non-test code they do not need to be generic and do not need callback functions; It would be a HUGE win for code understandably if we could somehow test this without making those classes generic.
src/ripple/overlay/Slot.h
Outdated
* doesn't count messages in this state but a received message | ||
* may transition Slot to Counting state. | ||
*/ | ||
template <typename Peer, typename clock_type> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the motivation to make this class generic - it help with testing. However, making this generic makes the code much harder to navigate and understand. We really need to think hard if there's another way to test the code that doesn't necessitate making this class generic.
This is not just a nit to me. Making this class use the concrete peer type, getting rid of all the generic callback functions, and getting rid of the comparator parameter would be a huge improvement in code understandably.
I do know we did something similar in the consensus code. Please think hard if there's another, better solution to the testing problem;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can eliminate the template all together. The idea is to introduce SquelchHandler
abstract class, which declares the squelch/unsquelch handlers. Then implementation in Overlay
and UnitTest
encapsulate specific object Peer
type and implement the handlers. We can provide on a fly handler for UnitTest
where we want to have some flexibility. The only disadvantage that I see so far of this approach is that instead of storing Peer
in the Slot
we are storing SquelchHandler
so we have to instantiate this object for every Peer
and validator. So the additional memory footprint is nValidators * nPeers. The concept is explained in this code example: https://gist.github.com/gregtatcam/4c2a680ab3e3cd5095f130bfb329e19f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thanks for diving into this.
There are a couple of ways we could try to solve this:
-
Templates and callbacks (the current approach). My biggest objection is ability to navigate and understand the code.
-
New base class for Overlay. I haven't coded this up, but I suspect there isn't enough shared implementation between the test and non-test code. So we'll end up testing something very different from the production code.
-
variant
based wrapper. I like the concreteness of this, but the test code requires several wrapped types. This solution works best if there was only one (or very few) wrapped types. -
type erased wrapper (the approach in your patch). This requires an extra allocation, but it is an improvement over the template based approach from a code navigability POV. We may be able to remove the allocation with small object optimization techniques.
-
Write a test framework that can handle the Overlays and Peers. This by far gives the cleanest, most understandable code. But we (1) don't have that framework and (2) unit tests are still valuable.
Given the above, I vote that we go forward with your proposed patch. However, you might want to wait to hear from other reviewers to see if they have other ideas on this issue as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can implement Peer
class in unit tests with most of the methods simply returning some value. This is not going to be the test framework or a step towards implementing the test framework but a way to avoid the extra allocation in (4). We can then pass std::weak_ptr<Peer>
to all applicable slot methods as we do today with Peer
being a concrete type so no template is required. We will have abstract class SquelchHandler
with squelch
and unsquelch
methods. The class is going to be implemented in Overlay
and UnitTest
. Reference to the base class is going to be passed to Slots
and Slot
constructors and squelch
and unsquelch
called as needed. UnitTest
can overwrite the handling on a fly providing flexibility of different handlers as needed. Does this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job getting rid of the templates and callbacks.
I left another set of minor comments, but I suspect that's the last batch from me.
* unsquelched, disconnected peer, or idling peer may transition Slot to | ||
* Counting state. | ||
*/ | ||
template <typename clock_type> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this class is simplified to take a single template parameter, consider moving the implementation to a .cpp
file and explicitly instantiating the clocks we need for linking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we have to move ManualClock from the unit test to the Slot's implementation in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd have to move the ManualClock to a common header file, and we normally would not include a test file like that in the main code (tho I think we have a good reason in this case). But let's not worry about that now, we can do that later in another PR if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not have a test file in the main code. I think a better way is to get rid of the clock template parameter altogether, in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Let a couple minor comments that do need to be addressed (a typo and the order of includes), but those changes are so trivial I'm OK approving it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the code with several test cases. They worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the latest patch, left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me confirm my understanding.
- we count and only count for relayed messages, no matter if the messages are trusted or not.
- for the same message, all the peers forwarded the message to the local node are recorded by the hashRouter.
- there are two code paths that peers are counted by the slots logic:
-- if they are recorded by the hashRouter before therelayed
timestamp is set, they are counted right after the local node try to relay the message. This also works for the case that the hashRouter has recorded all the peers (i.e. the message won't actually be sent.)
-- (note: thread safety (for the atomic operation between therelayed
timestamp and the set of peers) is protected by the mutex in hashRouter)
-- else (after therelayed
timestamp is set), the peers are counted when their messages are received.
-- (note: for both of the cases, peers are counted only if the message is relayed) - the container (peersWithMessage_) in slots is only for preventing over counting.
- if a peer forwards the message too late (8 seconds after the
relayed
timestamp was set), then the peer is not counted, which is OK.
* the message has been relayed. | ||
* @param key Unique message's key | ||
* @param validator Validator's public key | ||
* @param id Peers id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be peers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/ripple/app/misc/NetworkOPs.h
Outdated
* have already seen the messge; i.e. | ||
* the message has been received from | ||
* these peers and added to the hash | ||
* router. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "Return the set of peers from which the local node has received the message." accurate?
It is existing behavior, but could you please put a note saying this function will relay valid proposal messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/ripple/overlay/Slot.h
Outdated
|
||
template <typename clock_type> | ||
void | ||
Slots<clock_type>::unsquelch(id_t const id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of this function? I see this function is called by OverlayImpl::deletePeer(Peer::id_t const id)
. I feel it might be easier to understand if deletePeer(id, true);
was called directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me confirm my understanding.
- we count and only count for relayed messages, no matter if the messages are trusted or not.
- for the same message, all the peers forwarded the message to the local node are recorded by the hashRouter.
- there are two code paths that peers are counted by the slots logic:
-- if they are recorded by the hashRouter before therelayed
timestamp is set, they are counted right after the local node try to relay the message. This also works for the case that the hashRouter has recorded all the peers (i.e. the message won't actually be sent.)
-- (note: thread safety (for the atomic operation between therelayed
timestamp and the set of peers) is protected by the mutex in hashRouter)
-- else (after therelayed
timestamp is set), the peers are counted when their messages are received.
-- (note: for both of the cases, peers are counted only if the message is relayed)- the container (peersWithMessage_) in slots is only for preventing over counting.
- if a peer forwards the message too late (8 seconds after the
relayed
timestamp was set), then the peer is not counted, which is OK.
Yes, this is correct understanding on all points.
* unsquelched, disconnected peer, or idling peer may transition Slot to | ||
* Counting state. | ||
*/ | ||
template <typename clock_type> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not have a test file in the main code. I think a better way is to get rid of the clock template parameter altogether, in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good! (left one minor comment about a journal param, but marking this approved now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
3577885
to
49256bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything obviously wrong with this PR, but I think the code is a bit hairy. This isn't a criticism of your changes @gregtatcam; the code was already hairy when you started working on this. Separately from this PR, we should try to revisit how we do some things, especially the HashRouter
.
src/ripple/overlay/Message.h
Outdated
Message( | ||
::google::protobuf::Message const& message, | ||
int type, | ||
boost::optional<PublicKey> validator = {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document this parameter in the comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is documented. This is what it currently states:
@param validator Public Key of the source validator for Validation or Proposal message. Used to check if the message should be squelched.
src/ripple/overlay/impl/PeerImp.cpp
Outdated
{ | ||
if (!m->has_validatorpubkey()) | ||
{ | ||
JLOG(p_journal_.debug()) << "onMessage: TMSquelch, missing public key"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't be logging such malformed messages; rather we should just charge the peer for sending us bad data and we should drop the message on the floor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/ripple/core/Config.h
Outdated
@@ -179,6 +179,12 @@ class Config : public BasicConfig | |||
// Thread pool configuration | |||
std::size_t WORKERS = 0; | |||
|
|||
// Reduce-relay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment that these parameters are experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/ripple/overlay/Overlay.h
Outdated
/** Relay a validation. */ | ||
virtual void | ||
relay(protocol::TMValidation& m, uint256 const& uid) = 0; | ||
/** Relay a proposal. Return set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer doxygen format:
/** Relay a proposal.
@param m the serialized proposal
@param uid the id used to identify this proposal
@param validator The pubkey of the validator that issued this proposal
@return the set of peers which have already sent us this proposal
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -99,6 +99,7 @@ OverlayImpl::Timer::on_timer(error_code ec) | |||
overlay_.m_peerFinder->once_per_second(); | |||
overlay_.sendEndpoints(); | |||
overlay_.autoConnect(); | |||
overlay_.deleteIdlePeers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to run every second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to once every 4 seconds. A peer is considered idle if no message is received in 8 seconds.
src/ripple/app/misc/HashRouter.cpp
Outdated
@@ -49,12 +49,19 @@ HashRouter::addSuppression(uint256 const& key) | |||
|
|||
bool | |||
HashRouter::addSuppressionPeer(uint256 const& key, PeerShortID peer) | |||
{ | |||
auto [added, _] = addSuppressionPeerWithStatus(key, peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd simply: return addSuppressionPeerWithStatus(key, peer).first;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
std::pair<bool, std::optional<Stopwatch::time_point>> | ||
HashRouter::addSuppressionPeerWithStatus(const uint256& key, PeerShortID peer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HashRouter
API was already confusing enough. It's now become a little too confusing. I don't have an improvement to offer, but we need to think about this.
A server may receive multiple copies of a message from its directly connected peers. We select three peers as the source of validation and proposal messages from the validator and "squelch" other peers. This change reduces CPU and bandwidth costs and improves link latency.
Charge for bad TMSquelch Update comments
216c6e8
to
8ec800f
Compare